Automatic enhancement preprocessing for segmentation of low quality cell images

We present a novel automatic preprocessing and ensemble learning technique for the segmentation of low-quality cell images. Capturing cells subjected to intense light is challenging due to their vulnerability to light-induced cell death. Consequently, microscopic cell images tend to be of low quality and it causes low accuracy for semantic segmentation. This problem can not be satisfactorily solved by classical image preprocessing methods. Therefore, we propose a novel approach of automatic enhancement preprocessing (AEP), which translates an input image into images that are easy to recognize by deep learning. AEP is composed of two deep neural networks, and the penultimate feature maps of the first network are employed as filters to translate an input image with low quality into images that are easily classified by deep learning. Additionally, we propose an automatic weighted ensemble learning (AWEL), which combines the multiple segmentation results. Since the second network predicts segmentation results corresponding to each translated input image, multiple segmentation results can be aggregated by automatically determining suitable weights. Experiments on two types of cell image segmentation confirmed that AEP can translate low-quality cell images into images that are easy to segment and that segmentation accuracy improves using AWEL.

Figure 2 shows an overview of AEP.AEP consists of two deep neural networks.The first network is used for semantic segmentation, and the penultimate feature maps in the first network are used as filters to translate an input image into images that are easy to segment.The number of channels for the penultimate feature maps is the same as that for the segmentation classes, and the input cell image is translated into multiple images that emphasize each class.The second network is used to segment the images generated by the first network.The low-quality input cell image is translated by the filter, and the translated image was fed to the second network for segmentation.
Furthermore, we present automatic weighted ensemble learning (AWEL) to aggregate multiple segmented images generated by the first and second networks.Using AWEL, suitable weights are automatically determined, and the segmentation accuracy is further improved.
We conducted experiments to evaluate the proposed methods on two cell-segmentation datasets that distinguish cell images into multiple categories.The results confirmed that AEP can translate low-quality cell images into images that are easy to segment and that the segmentation accuracy improved using AWEL.Furthermore, add to 33 , we compared AEP with various previous network architectures 15,34 and conventional preprocessing methods 23,24,35,36 , and analyzed AEP and AWEL architectures, which are the effectiveness of AWEL, the number of translation filters, and the difference of output between the first network and the second network, to confirm their effectiveness.
The remainder of this paper is organized as follows.Section "Related work" presents related work.Section "Method" explains the proposed method in detail.Section "Experiments" presents the experiment results.Finally, we summarize the study and discuss future work in Section "Discussion".
The main contributions of this study are summarized as follows:

Image preprocessing
Image preprocessing methods include resizing, cropping, and color correction.Noise reduction is widely used for low-quality images.The most classical method for noise reduction is filtering 23,24,35 .Gaussian and bilateral filters 23,24 can blur low-quality images and reduce noise, and the Sobel filter 35 can emphasize object boundaries.However, the optimal parameters of these classical filters must be adjusted manually, and these parameters are sometimes unsuitable for deep learning.Super-resolution methods that use deep learning [25][26][27][28][29][30][31][32] are conceptually similar to the proposed method.Ledig et al. 27 proposed SRGAN, which is a generative adversarial network for image super-resolution.SRGAN recovered photorealistic textures from heavily downsampled images on public benchmarks and achieved impressive gains in perceptual quality.Zhan et al. 29 proposed very deep residual channel attention networks (RCAN) for image super-resolution.RCAN achieved higher accuracy and visual improvements compared with state-of-the-art image super-resolution methods.However, these methods require high-quality teacher images whose preparation is cost-and time-intensive.Recently, unsupervised super-resolution methods have been proposed 31,32 , but their image quality has been insufficient.Thus, using them to preprocess low-quality microscope images is difficult.GPU memory is also a problem because conventional networks for super-resolution enlarge the images.
Furthermore, a recent study proposed a learned image resizer using deep learning 42 .However, although this method is useful for image classification, it is ineffective for semantic segmentation.
Selecting a suitable preprocessing method is important for solving the actual cause of low-quality cell images.Unlike conventional methods, our proposed method can automatically preprocess cell images and simultaneously improve segmentation accuracy.

Method Ethics
In our study, no patient-related images are taken during the experiments.For the mouse liver cell image dataset 43 , the animal protocols were reviewed and approved by the Animal Care and Use Committee of the Kyoto University Graduate School of Medicine (No. 10584), and all methods were performed in accordance with the guidelines and regulations.

Automatic enhancement preprocessing (AEP)
We propose an unsupervised image translation method that uses deep learning to make an input image more suitable for segmentation.Figure 3 shows an overview of the proposed method.First, filters for translating input images into images suitable for segmentation are generated by the penultimate feature maps in the first network for cell image segmentation.The size of channels in the generated filters are the same as the input image.Because the first network outputs a segmentation image, the generated filters contain useful information for segmentation and emphasize objects related to the segmentation result.In this study, we call this is an automatic enhancement preprocessing using deep learning.We do not require high-quality ground truths to generate filters, and the method of generating filters for an input image to improve segmentation accuracy is also trained automatically.
When there are N datasets ({x k , y k } k=1...N ) of images x k and their labels y k , we show the translation equation from the input image with low quality to translated images in Eq. (1).
where x is the translated image, f ′ 1 is the first network as the translation function, and c is the number of translation filters.The filters generated by penultimate feature maps of the first network are added to the input image x k , and translated images x kc that emphasize important regions are generated.However, if the filters contain negative values, the shapes of objects reflected in the input images may be erased.Therefore, we use the ReLU function before the filter output to avoid negative information in the filters.Finally, translated images are normalized from 0 to 1 using a sigmoid function because the luminance value is too large to interfere with learning.The generated filters are added to the input image, subsequently, and the translated images x kc are fed to the second network f 2 for cell image segmentation.Because the number of translated images is the same as the number of translation filters, we feed each translated image to the second network f 2 independently, and the second network outputs multiple segmentation images.The segmentation results obtained from each translated image are different because each translated image differs from the original image.Finally, the segmented images generated by both the first network f 1 and the second network networks f 2 are aggregated using AWEL, and we generate the final segmentation image z k as shown in Eq. ( 2).
We reduce the total error by aggregating the segmentation outputs.Both networks for filter generation and segmentation are simultaneously trained to generate highly accurate segmentation results.
For semantic segmentation, we use the softmax cross-entropy loss for all outputs in Eq. ( 3).
where C is the number of categories in the dataset, y kc is the teacher label, and p kc is the probability value after a softmax function as p i = e z i j e z j .Further, z i is the i-th element of z , which is an output vector of the deep neural network.Equation (4) shows the final loss function.
(1) where CELoss n1 is the error of the first network output, CELoss n2 is the error of the outputs aggregated by AWEL, and CELoss n3c is the error of the second network against the c-th translated image.

Automatic weighted ensemble learning (AWEL)
The aim of ensemble learning is to aggregate the multiple segmentation images generated by the first and second networks into one segmentation result to improve segmentation accuracy.The ensemble has two types of averages: learning normal and weighted.In general, the weighted average is better if we assign large weights to important elements.However, determining suitable weight values is difficult.Therefore, we propose weighted ensemble learning, which automatically determines the weights using a 3D convolution layer.
Figure 4 shows the architecture of the weighted ensemble learning.The shape of each segmentation result of the first and second networks is , where H and W are the height and width of the output image, respectively, and C is the number of classes.All outputs are aggregated as where S is the number of outputs.Here, we use a 3D convolution layer with 1 × 1 × 1 kernels, a stride of 1, and padding of 0. This is called point-wise 3D convolution.Point-wise 3D convolution calculates only the channel direction.We can integrate this convolution layer into the aggregated array by replacing [S] in the aggregated array with the channel direction.Therefore, we can assign a weight w i , as in Fig. 4, to each segmentation output [C × H × W] through training, and automatically generate the final segmentation result from [S] results.

Network structures
Figure 5 shows an overview of the network structures.Our networks use encoder-decoder structures.We used a lighter structure than that of the original U-Net 10 to reduce the number of calculations because we trained two types of networks simultaneously.As shown in Fig. 5, the encoder layer includes one convolution layer, batch normalization 44 , activation ReLU, and dropout 36 .The decoder layer includes a deconvolution layer, batch normalization, activation ReLU, and dropout.The encoder and decoder blocks consist of two encoders and two decoder layers, respectively.Although one encoder or decoder block consists of three convolution layers in the original U-Net, we remove convolution layers individually, including the encoder and decoder blocks, and the bottom-most block of the encoder consists of one encoder layer.The encoder network consists of one input layer and six encoder layers, and the decoder network consists of six decoder layers.Skip connections are introduced at each resolution.In the first network, the output layer consists of two convolution layers.The outputs of the first convolution layer are used to translate an input image, and the output of the second convolution layer are used to predict each class.In the second network, only one convolution layer is used for semantic segmentation.

Experiments Datasets
We used 50 cell images of a mouse liver with a ground truth attached by Kyoto University 43 .The ground truth image includes three labels: cytoplasm, nucleus, and membrane.The images and ground truth have a size of 256 × 256 pixels.Thirty-five images were used for training, five for validation, and the remaining 10 images for evaluation.We used 5-fold cross validation while replacing images for evaluation.
We also evaluated our method on another cell-image dataset.We used absorbance microscopy images of human iRPE cells (iRPE dataset) 13 .The ground truth includes two types of labels: background and membrane.The images were split into 1032 regions of 256 × 256 pixels and their corresponding ground truths.We randomly rearranged the images, divided each dataset into 2 to 1 in numerical order, and prepared them as training or inference data.We divided the inference data into validation and test data (1:2) and used 3-fold cross validation while switching the training and inference data.
Additionally, we used 2D electron microscopy images of the ISBI2012 challenge (ISBI2012) 45 as a pseudo low quality dataset.This dataset is for binary segmentation of tubular structures spread over an image, i.e., cell membrane and background.We processed the original cell images in three ways to create three types of pseudo low quality cell images: (1) adding the random noise, (2) changing the contrast, and (3) adding the blur.For the random noise, we used the Gaussian noise ( µ = 0 , σ = 100 ).For changing image contrast, we also used the Gaussian noise ( µ = −100 , σ = 0 ), and we used a Gaussian filter (kernel size = 5) to add the blur.Since the resolution of ISBI2012 image is 512 × 512 , we cropped a region of 256 × 256 pixels from the original images due to the limitation of GPU memory.There is no overlap for cropping areas, and consequently, the total number of crops is 120.We randomly rearranged the images.Afterward, we divided each dataset into 2 to 1 in index order and prepared them as training or inference data, and used 3-fold cross validation while switching the training and inference data.
Figure 6 shows examples of cell images in the three datasets and their ground truths.Figure 6a shows a mouse liver cell image and its ground truth with three classes: cell nucleus (red), cell membrane (blue), and cytoplasm (green).Figure 6b shows a human iRPE cell image with two class labels: cell membrane (white) and background (black), and Fig. 6c shows ISBI2012 dataset with pseudo low quality: cell membrane (white) and background (black).

Training conditions and evaluation metrics
The images were normalized between 0 and 1, and no other preprocessing was performed.The batch size for training was set to 16, and Adam (betas = 0.9, 0.999) was used for optimization.The learning rate was set to 1 × 10 −3 .We trained all networks for 300 epochs, which is converged the training loss for all models and networks.The experiments evaluated AEP+AWEL and conventional segmentation networks 10,11,14,15,34 without preprocessing to demonstrate the effectiveness of AEP and AWEL.Furthermore, we evaluate conventional image preprocessing methods based on filters 23,24,35,36,46 .All experiments were conducted using the same dataset size, optimizer, and number of epochs, and a single Nvidia GTX 1080Ti GPU was used as a calculator.www.nature.com/scientificreports/ The segmentation accuracy of each class was evaluated using the interactive over union (IoU) and Dice score coefficient (DSC).The IoU and DSC compute the overlapping ratio between the predicted result and ground truth.Because the number of pixels in each class was different, we used the average score as the final evaluation measure.

Results on cell image with low quality
Comparison with conventional models Table 1 shows the segmentation results for the mouse liver cell image dataset.We evaluated the conventional methods 10,11,14,15,34 and AEP+AWEL.The AEP+AWEL method improved the IoU by approximately 1.41% for cell nuclei and 2.95% for cell membranes compared with U-Net without preprocessing.The DSC of our method improved by approximately 1.04% for cell nuclei and 3.00% for cell membranes.The average IoU improved by approximately 1.63%, and the average DSC by approximately 1.48%.Surprisingly, the ground truth was not used for translated images, but adequate preprocessing for segmentation was realized.This result demonstrates the effectiveness of the proposed automatic preprocessing method.
We also evaluated the proposed method using cell membrane datasets.Table 2 shows the results for the human iRPE cell images.The AEP+AWEL method improved the IoU by approximately 2.55% and the DSC by approximately 2.24% for cell membranes.The average IoU improved by approximately 1.19% and the average DSC by approximately 1.07% compared with the baseline U-Net without preprocessing.The results demonstrate that the proposed method is effective for other cell-image datasets.
Figure 7 visualizes the segmentation results for the two types of cell-image datasets.Focusing on the yellow squares in Fig. 7, the proposed method can segment cell membranes that conventional U-Net and SAUNet  www.nature.com/scientificreports/cannot segment well.Our method worked well even if the input images differed significantly from the previous experiment.The segmentation accuracy of the proposed method is better than those of U-Net and SAUNet without preprocessing.

Comparison with preprocessing methods
Table 3 shows the results of conventional image preprocessing methods.We evaluated the conventional filtering methods 23,24,35,36 and our automatic preprocessing method.The kernel size of all filtering methods was set to 3 × 3 and 9 × 9 .As shown in Table 3, AEP achieved the best accuracy for the two types of cell image datasets.
For the mouse liver cell image dataset, although conventional preprocessing methods ineffectively improved the segmentation accuracy, AEP improved the IoU score of cell membranes and nuclei.On the iRPE cell image dataset, although the conventional filtering methods, except the bilateral filter, tended to reduce the accuracy, our preprocessing method achieved better IoUs in all classes.Figure 8 visualizes the results of image preprocessing.Focusing on the yellow squares in Fig. 8, for the mouse liver cell image dataset, confirming cell nuclei and membranes with low brightness in the original image was impossible.When the median and Gaussian filters were used, noise in the original image was reduced, but the output images were blurred.The bilateral filter was nearly unchanged in terms of quality, and the Sobel filter emphasized the edges of the object too much and consequently retained its shape as a cell.However, using AEP, cell nuclei with low brightness became clear, and cell membranes, which had become similar to noise, were more clearly emphasized.Although the cell membranes in the noisy part were difficult for humans to segment, we confirmed that the cell membrane is emphasized more by the filter, and the generated filter is suitable for segmentation.The IoU on the cell membranes using AEP improved by 2.62%.For the iRPE cell image dataset, although the conventional filtering methods were minimally effective, AEP generated a preprocessing image that emphasized the cells.These results demonstrate the effectiveness of our translation filter in that the necessary information for segmentation in the image is emphasized, and unnecessary information is suppressed.www.nature.com/scientificreports/

Results on cell image with pseudo low quality
Table 4 shows the segmentation results for ISBI2012 dataset with pseudo low quality.In Table 4, "Noise" means adding Gaussian noise, "Contrast" means changed the image contrast, and "Blur" means used the Gaussian filter for the input image to blur.We evaluated the baseline model (U-Net) and our AEP+AWEL using the IoU metric.
As shown in Table 4, AEP+AWEL improved the IoU by approximately over 1.00% for cell membrane compared with U-Net.Consequently, the average IoU improved by approximately 1.66% for the noise, by approximately 1.74% for the contrast, and by approximately 1.82% for the blur.We believe that these results demonstrate the generalization performance of AEP+AWEL.Figure 9 visualizes the segmentation results for ISBI2012 dataset with pseudo low quality.Focusing on the yellow squares in Fig. 8, there are some miss-predictions regions in what is originally the background class as a result of pseudo-degradation.However, by using AEP+AWEL, we can be to control over-detection, and get a more accurate segmentation result.We confirmed that the generalization performance of our proposed preprocessing method from a qualitative aspect as well.

Ablation studies
Effectiveness of AEP Table 5 shows the results of the ablation studies for AEP.We compared our proposed AEP+AWEL with AEP used outputs of the first network instead of penultimate feature maps to confirm whether the penultimate feature maps are the most effective for preprocessing.Furthermore, we also evaluated outputs used the softmax layer and the argmax layer.As shown in Table 5, our proposed translation method used the penultimate feature maps was the best average IoU, and we consider that this is because the penultimate feature maps can get more

Validation of the number of translation filters
Figure 12 shows the results of the ablation studies on the number of translation filters for AEP.We compared the number of translation filters set to double(×2 ), triple(×3 ), quadruple(×4 ), and quintuple(×5 ) the number of segmentation classes measured by the average IoU.As shown in Fig. 12, the best IoU was obtained when we set the number of translation filters to the number of classes ( ×1 ) for both cell image datasets.The average IoU tended to decrease as the number of translation filters increased.Increasing the number of translation filters is expected to result in filters that are unrelated to each object.
Figure 13 shows the visualization results of translation filters using AEP.As shown in Fig. 13, the generated filters were the same images when we quintupled the number of segmentation classes as translation filters ( ×5 ).Consequently, the enhancement of each class from the segmentation results was less effective.Based on this validation, we confirm that the number of translation filters should be set to the same number of segmentation classes.

Discussion
In general, although raw cellular images tend to be low quality, all of the publicly available datasets for segmentation, which are easy to use, are quite clean and easy to train for deep learning models.Then, there are very limited of low-quality cellular image datasets for segmentation that can be used, and as a result, we only evaluated on two datasets in this study.Furthermore, to confirm the generalization performance of our proposed method, we processed publicly available clean cell image datasets to create and evaluate three types of pseudo low quality images.As shown in Table 4 and Fig. 9, our proposed method performs well even with pseudo cellular images, which we believe demonstrates the generalization performance of the proposed method.

Conclusion
In this study, we focused on a pre-processing method for low quality cell images using deep learning, which has not been discussed, and proposed a segmentation method using automatic preprocessing and ensemble learning.In experiments on actual cell images, we translated input images into images that are easy to segment, and the average IoU improved by approximately 1.63% compared with a segmentation network without preprocessing.In addition, the proposed method performed well on another cell image dataset.From evaluation experiments using pseudo low quality cell images, we confirmed the generalization performance of our proposed method.Although our method uses the ground truth label for training the first network, by combining an unsupervised learning approach, it may be possible to add further expressiveness to the automatic preprocessing filter.This may further improve accuracy, and it is a subject for future research.

Figure 1 .
Figure 1.Examples of cell image and its penultimate feature map.(a) Low-quality cell image as input.(b) One of the penultimate feature maps when the cell image is fed to a model based on a CNN.

Figure 2 .
Figure 2. Overview of AEP.AEP consists of two deep neural networks.The first network preprocesses images, and the second network segments the images generated by the first network.

Figure 3 .
Figure 3. Overview of AEP+AWEL architecture.When we use a segmentation dataset of three classes, we set three translation filters.Each translation filter is added to an input cell image, and we obtain three translation images.Each translation image emphasizes objects related to the segmentation result.Translated images are fed to the second network one-by-one for segmentation, and we compute the loss for all segmentation images using AWEL.

Figure 4 .
Figure 4. AWEL architecture using 3D convolution layer.In the segmentation of three classes, we prepare four segmentation outputs.The first network generates one segmentation result, and the second network generates three segmentation results.To aggregate all outputs and generate the final segmentation results, we use weighted ensemble learning.Weights are automatically determined by 3D convolution.

Figure 5 .
Figure 5. Overview of network structures.The proposed method is based on the U-Net architecture.The encoder and decoder networks consist of six layers.Each layer includes a convolution layer (Conv), batch normalization (BN), activation ReLU (ReLU), and dropout (DP).The first network (Network1) obtains segmentation results and translation filters using two output layers, and the second network (Network2) obtains segmentation results at the output layer.

Figure 6 .
Figure 6.Examples of cell images and their ground truths in two datasets.(a) shows the cell image of a mouse liver with three class labels: cell nucleus (red), cell membrane (blue), and cytoplasm (green).(b) shows a human iRPE cell image labeled as: cell membrane (white) and background (black).(c) shows ISBI2012 dataset with pseudo low quality: cell membrane (white) and background (black).

Figure 11 .
Figure 11.(a) and (b) are the ablations on weights used by AEP.

Figure 12 .
Figure 12.Ablation on the number of translation filters for AEP architecture.The red line is the mouse liver cell image dataset, and the blue line is the human cell image dataset.

Figure 13 .
Figure 13.Visualization results of translation filters using AEP.(a) is the input image; (b) is the segmentation label; (c-e) are the filters when we set the number of translation filters to the number of classes ( ×1 ); and (f-j) are examples of filters when we quintupled ( ×5 ) the number of segmentation classes.

Table 1 .
Comparison between the conventional and proposed methods on the cell image dataset of mouse livers.Significant values are in bold.

Table 2 .
Comparison between the conventional and proposed methods on human iRPE cell images.Significant values are in bold.

Table 4 .
Comparison between the conventional and proposed methods on the cell image datasets with pseudo low quality.Significant values are in bold.